Excel reports are everywhere you look. The reports are generated by accountants, sales and engineering teams, statistical bureaus and municipalities to name a few. These reports are organized and formatted with the intention of making it easier to read and understand the report. However, there is no standard way of generating Excel report. Depending on the nature of the report producer and the workflow of a given organization, Excel reports vary and often contains several rows and columns of numbers which can be challenging to comprehend.
As discussed on my last blog how to transform Excel data to R, R is a software tool that can be employed to compliment and enhance Excel reports. R, with tidyvrse libraries, can help further optimize data analysis by adding capabilities such as pulling advanced statistical summaries, enabling reproducability and automation of reports, modeling and generating visual summaries that overcome some of Excel’s limitations.
On this blog I will go over three Excel reports to demonstrate how R (tidyverse libraries) is used to convert Excel report to a “tidy” format. “Tidying is structuring datasets to facilitate analysis” read here about tidy and generate a visual summaries that compliment the report, and I believe easier to understand. The main objective of this blog is to demonstrate Excel formatted data conversation to to tidy format using R’s tidyverse libraries.
I will be using publicly available Excel data from [Toronto Transport Commission Ridership], a typical [cash flow] Excel report and Excel report from 2017 [Australia statistics Bureau] survey regarding marriage. This three reports were used as case study in the excellent Spreadsheet Munging Strategies book by Duncan Garmonsway, which employs the tidyxl and unpivotr packages. Although this two packages are very good and helpful working with Excel, I will be using strictly tidyverse for all Excel munging on this blog.
For each example, we will identify the Excel report and label the areas we will need to consider for munging, and generate an R script that will render a tidy-data-table, that I will use to generate a visual graphic summarization plot. Of course, once data is in tidy format, it makes it easier to extract additional knowledge applying empirical and statistical methods, extending the analysis for additional insights. But for this blog, we will limit to visual summarization.
If you want to follow along, download all relevant unprocessed datahere
The Toronto Transport Commission (TTC) generates statistical data of ridership which can be downloaded from here, periodically, for public consumption in Excel. When you open the Excel, you can see that, while the table is formatted for human readers, it isn’t in a tidy format. To turn the Excel into tidy format, a number of transformation will have to be implemented. With out further ado, lets jump right into the steps with the TTC Excel.
I use the As you can see on Figure 3.1 there are ten items to consider before arriving to a tidy format. While the header and footers, blank cells, variable name liked ‘Adult & Bus’ in data space, NA values and subtotal make it easier to read, they are not needed in a tidy format. I have used a tidyverse script to mung the Excel data, that take into account each of ten items highlighted (follow the note on the script for details).
Figure 3.1: Toronto Transit Excel data annotated.
The first step in starting the process is importing the Excel data into R system. I use the readxl library that imports ‘.xls’ and ‘.xlsx’ Excel files. The readxl library provides several useful features (technically referred to as ‘arguments’) that reduces the amount of tasks used to import and transform Excel data. For example, if there are multiple sheets on the spreadsheet, you can specify which sheet you want to work from with the sheet argument, if you want to skip the first n rows, you can use the skip argument, specify cell with cell_rows, columns with range arguments, and also be able to define the column names and type with built in arguments. I will not dive into details about every function in the readxl library, but it is worth reading and understanding the arguments from the help pages when importing Excel data.
In addition to the readxl library, I have used data carpentry libraries in tidyverse including stringr, dplyr and tidyr. You will notice in the script functions such as select, filter, mutate, fill and slice are used to clean, reshape, modify and render the tidy data.
( _ Note: read the notes following # to follow along the transformation._ )
# Import the Excel data and use the "N/A" marker for missing cell values and skip the first 4 rows and start with the 5th.
toronto_transit <- read_excel("data/toronto_transit.xlsx", na = "N/A", skip = 4)
# rename the first 2 columns, append fy on the rest of the columns
df <-
toronto_transit %>%
select_all(~str_replace(., "^", "fy_")) %>%
select(type = 1, fare_media = 2, c(3:35))
# Tidy data
# the set of instuction will complete unfilled cells, remove the lines that contain "TOTAL", replace na values with 0,
# create a features that will relable values to make it easier to understand, select the features we need to generate the data,
# and generate the tidy data table that we will use for analysis.
df_tidy <-
df %>%
fill(type) %>%
slice(1:50) %>%
filter(!str_detect(fare_media, "TOTAL")) %>%
mutate_at(vars(fy_2017:fy_1985), ~replace(., is.na(.), 0)) %>%
mutate(type_of = case_when(fare_media == "ADULT" ~ "adult",
fare_media == "SENIOR/STUDENT" ~ "senior_student",
fare_media == "CHILDREN" ~ "children",
fare_media == "BUS" ~ "the_bus",
fare_media == "RAIL" ~ "rail",
fare_media == "WEEKDAY" ~ "day_of_week"
)) %>%
select(type, fare_media, type_of, fy_2017:fy_1985) %>%
fill(type_of) %>%
mutate(line_num = seq(1:n())) %>%
filter(line_num != 33) %>% # special case for the BUS
filter(!fare_media %in% c("ADULT","SENIOR/STUDENT", "CHILDREN", "RAIL")) %>%
select(-line_num) # done its job - thank you!
# Let us now display the TTC tidy data in a table
knitr::kable( df_tidy,
caption = 'The Troronto Ridership data in tidy format.',
booktabs = TRUE
) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center") %>%
scroll_box(width = "800px", height = "500px")
| type | fare_media | type_of | fy_2017 | fy_2016 | fy_2015 * | fy_2014 | fy_2013 | fy_2012 | fy_2011 | fy_2010 | fy_2009 | fy_2008 | fy_2007 | fy_2006 | fy_2005 | fy_2004 | fy_2003 | fy_2002 | fy_2001 | fy_2000 | fy_1999 | fy_1998 | fy_1997 | fy_1996 | fy_1995 | fy_1994 | fy_1993 | fy_1992 | fy_1991 | fy_1990 | fy_1989 | fy_1988 | fy_1987 | fy_1986 | fy_1985 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| WHO | TOKENS | adult | 76106 | 102073 | 110945 | 111157 | 112360 | 117962 | 124748 | 120366 | 114686 | 94210 | 69134 | 75340 | 82162 | 80859 | 80330 | 82102 | 83771 | 82218 | 83028 | 85303 | 86991 | 87857 | 87775 | 97877 | 104016 | 114064 | 111365 | 119538 | 114874 | 122180 | 127088 | 126217 | 128207 |
| WHO | TICKETS | adult | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1298 | 8807 | 34445 | 65398 | 68546 | 73151 | 72952 | 71485 | 74578 | 70930 | 66331 | 64109 | 66490 | 66177 | 67164 | 70369 | 62700 | 57710 | 53655 | 35788 | 38369 | 37401 | 39514 | 38944 | 42052 | 48793 |
| WHO | TWO-FARE | adult | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4329 | 12525 | 13265 | 12894 | 11201 | 17927 | 22313 | 27025 | 18837 | 8976 | 7347 | 0 |
| WHO | PRESTO - SINGLE RIDE | adult | 67829 | 27397 | 13323 | 9862 | 8194 | 4399 | 1139 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| WHO | PRESTO - SRVM TOKEN RIDE | adult | 1271 | 1157 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| WHO | PRESTO - SRVM CASH RIDE | adult | 821 | 582 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| WHO | PRESTO - MONTHLY PASS | adult | 1613 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| WHO | REGULAR MONTHLY PASS | adult | 174549 | 194820 | 204509 | 214932 | 213982 | 205086 | 194928 | 203101 | 208172 | 203313 | 195001 | 171314 | 140594 | 125836 | 119681 | 116805 | 118176 | 112081 | 103447 | 98473 | 91521 | 86549 | 96803 | 96907 | 100607 | 109509 | 108148 | 116610 | 113506 | 119264 | 109151 | 101901 | 94970 |
| WHO | POST-SECONDARY PASS | adult | 52721 | 51861 | 48396 | 42855 | 38426 | 35019 | 32091 | 9200 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| WHO | TWIN-GO PASS | adult | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 762 | 1118 | 1105 | 1430 | 1649 | 1592 | 1652 | 1976 | 2111 | 2514 | 2924 | 3235 | 2758 | 0 | 0 | 0 | 0 | 0 |
| WHO | WEEKLY PASS | adult | 6653 | 7547 | 8843 | 9361 | 9557 | 10185 | 9893 | 9237 | 8738 | 7517 | 7126 | 5413 | 1296 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| WHO | CASH | adult | 36045 | 41536 | 48873 | 49120 | 48623 | 46467 | 43795 | 43149 | 41445 | 39408 | 36317 | 38684 | 47521 | 55172 | 51328 | 51052 | 58400 | 61539 | 54835 | 49658 | 46209 | 32642 | 20930 | 20708 | 22131 | 23696 | 60034 | 67296 | 65665 | 66872 | 75308 | 66475 | 63986 |
| WHO | MONTHLY PASS | senior_student | 27324 | 27621 | 25092 | 23064 | 20509 | 19769 | 18590 | 17169 | 15331 | 14864 | 14506 | 12931 | 11068 | 9940 | 10586 | 11123 | 12397 | 11785 | 10124 | 9419 | 8647 | 9098 | 11028 | 10716 | 10508 | 10389 | 5850 | 5117 | 4570 | 4541 | 3855 | 3288 | 2519 |
| WHO | WEEKLY PASS | senior_student | 1011 | 959 | 672 | 515 | 540 | 624 | 702 | 814 | 874 | 780 | 686 | 372 | 93 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| WHO | TICKETS | senior_student | 31195 | 32997 | 32595 | 33408 | 35472 | 37039 | 38299 | 38674 | 38615 | 39097 | 40181 | 40808 | 42746 | 41562 | 41844 | 44018 | 44012 | 43885 | 44263 | 46559 | 48306 | 52852 | 58515 | 56539 | 56625 | 57082 | 56559 | 61358 | 61837 | 65708 | 65927 | 62963 | 61193 |
| WHO | TWO-FARE | senior_student | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 93 | 63 | 0 | 0 | 0 |
| WHO | PRESTO - SINGLE RIDE | senior_student | 5703 | 1421 | 438 | 12 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| WHO | PRESTO - SRVM CASH RIDE | senior_student | 253 | 210 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| WHO | PRESTO - MONTHLY PASS | senior_student | 26 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| WHO | CASH | senior_student | 12532 | 10440 | 12170 | 12037 | 8538 | 8164 | 7609 | 5856 | 5526 | 5253 | 4211 | 4581 | 6549 | 7602 | 6759 | 6440 | 7507 | 7921 | 7481 | 7197 | 7665 | 8114 | 5202 | 4772 | 4703 | 3725 | 6030 | 6324 | 6267 | 6359 | 7449 | 10852 | 13117 |
| WHO | FREE RIDES | children | 24856 | 21875 | 10939 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| WHO | TICKETS | children | 0 | 0 | 1066 | 7097 | 7563 | 7929 | 8304 | 8287 | 8562 | 8782 | 8959 | 8879 | 8143 | 7573 | 7915 | 8869 | 9133 | 9401 | 9186 | 9640 | 9667 | 10129 | 11338 | 12192 | 12193 | 11569 | 11722 | 12417 | 12261 | 13194 | 12581 | 12074 | 10884 |
| WHO | PRESTO - FREE CHILD RIDE | children | 163 | 36 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| WHO | CASH | children | 0 | 0 | 526 | 3705 | 2708 | 2589 | 2433 | 2539 | 2410 | 2253 | 1933 | 2168 | 3916 | 4514 | 4072 | 3938 | 3999 | 4203 | 4209 | 4344 | 4313 | 3674 | 2877 | 2717 | 2486 | 1860 | 2546 | 3071 | 2916 | 3175 | 3526 | 3844 | 4768 |
| WHO | DAY/VIST./OTHER | children | 6728 | 9130 | 8561 | 10033 | 11428 | 11929 | 10642 | 10605 | 10880 | 9961 | 9636 | 9194 | 7598 | 6191 | 5817 | 9685 | 4943 | 4837 | 5576 | 5236 | 4611 | 4234 | 4833 | 4764 | 4228 | 1853 | 2572 | 972 | 1312 | 641 | 612 | 649 | 630 |
| WHO | BLIND/WAR AMPS | children | 1086 | 1088 | 1086 | 1119 | 1109 | 1086 | 1060 | 1073 | 1074 | 1092 | 1094 | 1025 | 1026 | 1065 | 1113 | 1275 | 1223 | 1138 | 1131 | 1137 | 1057 | 1057 | 1059 | 1119 | 1122 | 1127 | 1093 | 1098 | 1139 | 1252 | 1593 | 1506 | 1272 |
| WHO | PREMIUM EXPRESS | children | 448 | 474 | 490 | 451 | 401 | 372 | 344 | 322 | 313 | 310 | 295 | 259 | 260 | 278 | 280 | 282 | 339 | 330 | 301 | 307 | 310 | 322 | 425 | 437 | 417 | 257 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| WHO | POSTAL CARRIERS | children | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 58 | 702 | 700 | 664 | 683 | 719 | 753 | 790 | 783 | 902 | 1107 | 1122 | 1085 | 1331 | 1340 | 1298 | 1993 | 1860 | 1875 | 1874 | 1844 | 1821 |
| WHO | GTA PASS | children | 4283 | 4855 | 5471 | 6087 | 5784 | 5388 | 5642 | 5667 | 5800 | 5415 | 5292 | 4972 | 4395 | 3855 | 3538 | 3927 | 3326 | 3031 | 2683 | 2494 | 1915 | 1650 | 1375 | 343 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| WHERE | BUS | the_bus | 261113 | 252899 | 238943 | 245292 | 239968 | 234582 | 223269 | 219855 | 218545 | 215997 | 216341 | 206526 | 209891 | 201954 | 195817 | 207392 | 205762 | 203234 | 196785 | 195720 | 196056 | 188070 | 192497 | 191020 | 189060 | 192495 | 189150 | 202681 | 198078 | 208057 | 203400 | 188912 | 192690 |
| WHERE | SUBWAY | rail | 213012 | 221622 | 228129 | 219849 | 217250 | 216101 | 213280 | 199131 | 199321 | 196004 | 191338 | 181736 | 174046 | 169501 | 163736 | 163446 | 164214 | 158277 | 150531 | 147369 | 143751 | 142081 | 148349 | 151214 | 153020 | 158658 | 164714 | 179815 | 177302 | 178577 | 174963 | 166892 | 161914 |
| WHERE | S.R.T. | rail | 3177 | 2951 | 3352 | 4254 | 4661 | 4667 | 4766 | 4232 | 4300 | 4639 | 4700 | 4166 | 4440 | 4117 | 3612 | 4128 | 3757 | 3480 | 2835 | 2514 | 2544 | 2317 | 2633 | 2951 | 3465 | 2672 | 2854 | 3518 | 3567 | 3001 | 3361 | 3248 | 2099 |
| WHERE | TROLLEY COACH | rail | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1805 | 1077 | 13814 | 16117 | 15403 | 15217 | 17828 | 17771 | 17355 |
| WHERE | STREETCAR | rail | 55914 | 60607 | 63581 | 65420 | 63315 | 58657 | 58904 | 54139 | 49067 | 50060 | 47390 | 52116 | 42843 | 42527 | 42247 | 40573 | 46260 | 45567 | 42442 | 43086 | 37532 | 39962 | 44673 | 43067 | 46135 | 49349 | 53635 | 57103 | 56376 | 58623 | 57332 | 64189 | 58102 |
| WHEN | WEEKDAY | day_of_week | 424155 | 424117 | 423808 | 423269 | 416297 | 406913 | 395578 | 379810 | 374908 | 374765 | 368917 | 357814 | 348600 | 340549 | 332551 | 339780 | 343611 | 334198 | 320592 | 317972 | 311422 | 305983 | 318839 | 319087 | 324529 | 334515 | 350714 | 379987 | 374024 | 385093 | 382917 | 369295 | 358947 |
| WHEN | WEEKEND/HOLIDAY | day_of_week | 109061 | 113962 | 110197 | 111546 | 108897 | 107094 | 104641 | 97547 | 96325 | 91935 | 90852 | 86730 | 82620 | 77550 | 72861 | 75759 | 76382 | 76360 | 72001 | 70717 | 68461 | 66447 | 69313 | 69165 | 68956 | 69736 | 73453 | 79247 | 76702 | 78382 | 73967 | 71717 | 73213 |
Once we have converted the Excel into a tidy data, we can generate all hosts of statistical summaries. On this case, I will modify the data further and generate a data visualization with ggplot2, another powerful data visualization package from tidyverse. The rendered graphics summarizes the 32 year TTC data so that we can easily see the visual summary side by side. I will limit the analysis to generating three plots (See Figure 3.2, each one of the figures summaries different parts of the data, all in one pane.
1. The Toronto public transport ridership trend over 32 years(’85-’17).
2. Bus and Rail trend over the same 32 years period.
3. And the trend or ridership by age group.
# Vizualizelabel_number_si
gg_ttc <-
df_tidy %>%
pivot_longer(cols = fy_2017:fy_1985) %>%
group_by(type, type_of, name) %>%
summarise(total = mean(value)) %>%
ungroup() %>%
mutate_all(~str_replace_all(., "^fy_", "")) %>%
mutate_each(list(as.numeric), 3:4) %>%
mutate(type_two = case_when(type == "WHEN" ~ "Number of Riders (when)",
type == "WHERE" ~ "Mode of transport Bus or Rail(where)",
type == "WHO" ~ "Adult/Seniors_students/Children (Who)"
)) %>%
ggplot(aes(x = name, y = total, group = type_of)) +
geom_line(aes(col = type_of), size = 2) +
scale_y_continuous(labels = scales::number_si, breaks = seq(2000, 270000, by = 20000)) +
scale_x_continuous(breaks = seq(1985, 2019, by = 5)) +
facet_wrap(type_two ~., scales = "free_y", nrow = 1) +
theme_abiyu() +
theme(legend.position = "right",
legend.title=element_blank()) +
labs(title = "1985-2017 Toronto Public transport Ridership Trend",
#subtitle = "",
caption = "Source: TTC | summary &Viz: @abiyugiday",
x = "years",
y = "total rider ship")
gg_ttc
Figure 3.2: TTC transit trend from 1985-2017
The Cash Flow Excel sheet is an example of what a typical cash flow report might look like. Again, we are going to apply the same workflow as TTC from section 3. We will use readxl to import the data into R, identify the items we want to modify to turn it into tidy format, and generate a script that will address the identified items and finally generate a visual graphics that will summarize the data.
As you can see on Figure 4.1, I have identified 7 items that we will need to consider on the cash flow Excel data. This includes, removing the header and footer, remove the empty columns in between months, adding a feature for the major expense types (cash inflow, cash outflow and other cash out flow) and removing total amounts from the Excel.
Figure 4.1: Cash flow annotated.
Again, we start by importing the Excel into R with the read_excel function. The import will skip the first three rows removing the header that we will not use. We then use the scoped variant functions of the tidyverse to clean the data, and use the select, mutate and filter functions to transform the data.
( Note: read the notes following # in the code to follow along the transformation. )
# Import Excel and skip the first 3 raws, which remove the header
cash_flow_1 <- read_excel("data/cash_flow_1.xlsx", skip = 3)
# Use the imported data, remove all empty columns, remove character such as barackets, colons and asterisk, create a new vriable and coppy values, choose the relvant variables, remove rows that contain only missing values (na) and remove columns that will not be used to create a tidy format (ex the Excel summary values that contain 'TOTAL)
df_cashFlow_tidy <-
cash_flow_1 %>%
select_if(~!all(is.na(.))) %>%
# inflow clean up
mutate_all(~str_replace_all(., "\\(", "")) %>%
mutate_all(~str_replace_all(., "\\)", "")) %>%
mutate_all(~str_replace_all(., ":", "")) %>%
mutate_all(~str_replace_all(., " ", "_")) %>%
mutate_all(~str_replace_all(., "\\*", "")) %>%
#tidy step
select(type_1 = 1, type_2 =2 , everything()) %>%
mutate(type_of = case_when(type_1 == "Cash_Inflows_Income" ~ "Cash_Inflows_Income",
type_1 == "Cash_Outflows_Expenses" ~ "Cash_Outflows_Expenses",
type_1 == "Other_Cash_Out_Flows" ~ "Other_Cash_Out_Flows"
)) %>%
select(type_1, type_2, type_of, everything()) %>%
fill(type_of) %>%
tidyr::drop_na(type_2) %>%
mutate_all(tolower) %>%
filter(!stringr::str_detect(type_2, "total")) %>%
select(-type_1, -TOTALS)
# We will now render and print out a formatted tidy data as seen in Table 4.1
knitr::kable( df_cashFlow_tidy,
caption = 'The Troronto Ridership in tidy format.',
booktabs = TRUE
) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center") %>%
scroll_box(width = "800px", height = "500px")
| type_2 | type_of | April | May | June | July | Aug | Sept | Oct | Nov | Dec | Jan | Feb | Mar |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cash_collections | cash_inflows_income | 4419 | 1198 | -869 | -769 | 3875 | -3059 | -3164 | -4995 | 2646 | -3945 | -1894 | 1690 |
| credit_collections | cash_inflows_income | -1698 | 4660 | 2814 | 4607 | -943 | 2582 | 963 | 1997 | -3995 | 4792 | -839 | 422 |
| investment_income | cash_inflows_income | -4260 | 4307 | -684 | 4888 | -3672 | 2330 | -4344 | -2361 | -924 | 4614 | -2765 | 1750 |
| other | cash_inflows_income | -3231 | -4810 | -3050 | 4442 | -1081 | 1268 | 2456 | -4041 | -4616 | 4478 | 2514 | -3159 |
| advertising | cash_outflows_expenses | 3295 | 4587 | 3432 | 2983 | -2939 | -2683 | -4333 | 4052 | 2874 | -4376 | -1353 | -4956 |
| bank_service_charges | cash_outflows_expenses | 136 | -3635 | -3901 | -802 | -4306 | 2050 | -841 | 555 | -2577 | -743 | 3414 | -3909 |
| insurance | cash_outflows_expenses | -2360 | -2477 | 1148 | -3858 | 3518 | 4453 | -4665 | -2059 | -2045 | 327 | 896 | -2656 |
| interest | cash_outflows_expenses | 4497 | -93 | -4514 | 2501 | 4345 | -4023 | 1616 | 4465 | -4967 | -3901 | 1266 | -4184 |
| inventory_purchases | cash_outflows_expenses | -764 | 2915 | -4563 | -3212 | -3851 | 1385 | -2026 | -3229 | -4869 | -3118 | 4539 | -698 |
| maintenance_&_repairs | cash_outflows_expenses | 985 | -2879 | -3116 | -4854 | 3715 | -3231 | -3704 | -3653 | 2 | 3204 | -4880 | -2242 |
| operating_supplies | cash_outflows_expenses | 3530 | 1674 | 4467 | -370 | 4533 | -3254 | -164 | -1731 | -4495 | -394 | -3415 | 1236 |
| payroll | cash_outflows_expenses | -3249 | 1973 | 2359 | 1528 | 1963 | -758 | -2437 | -868 | 2931 | -3798 | -1279 | 2765 |
| payroll_expenses | cash_outflows_expenses | 2786 | -3061 | 138 | 2356 | -4213 | -4661 | 3296 | -1736 | 662 | 2401 | -3032 | 4176 |
| sales_commissions | cash_outflows_expenses | -2983 | -4151 | -1732 | 695 | -545 | 1561 | 382 | 2175 | 3617 | -1627 | 2027 | -2025 |
| professional_fees | cash_outflows_expenses | -4487 | 4047 | 783 | -4517 | 1103 | 776 | 974 | -1764 | -3889 | 4451 | -3238 | -1509 |
| communications_equip. | cash_outflows_expenses | 2804 | 2664 | -3194 | 2986 | 378 | 999 | -2573 | -1119 | 1180 | -1663 | 4444 | 1549 |
| rent | cash_outflows_expenses | 3449 | -2594 | -3879 | 1591 | -3892 | 3706 | 1439 | -4092 | 2842 | -4741 | 4013 | 4912 |
| office_supplies | cash_outflows_expenses | 2335 | 1633 | -232 | 2671 | -3212 | 4593 | 2614 | -2096 | -847 | -3336 | 1876 | 2579 |
| permits_&_licenses | cash_outflows_expenses | 2129 | -2288 | 4199 | 3191 | -2516 | 534 | -4839 | -1453 | 1496 | -1000 | -4695 | -3374 |
| utilities_&_telephone | cash_outflows_expenses | -2947 | -1485 | 4847 | -2918 | -887 | 316 | -592 | 4877 | -739 | -4195 | 1619 | 4061 |
| travel | cash_outflows_expenses | -2093 | 2709 | -4053 | 1227 | 4932 | -1445 | -4039 | -2004 | -3570 | -250 | -150 | 316 |
| taxes | cash_outflows_expenses | -859 | 175 | -143 | -10 | 846 | 185 | -3101 | 3151 | -2675 | 3488 | -2132 | 3622 |
| other | cash_outflows_expenses | 998 | -2835 | -988 | -1206 | 2069 | -627 | 2095 | -4390 | 2580 | -4456 | -3033 | 4786 |
| sales_training | cash_outflows_expenses | 1623 | 457 | 3136 | 1675 | -2873 | -3310 | -2990 | 2721 | 957 | 3453 | 4799 | -3701 |
| capital_purchases | other_cash_out_flows | 1901 | 3811 | -1270 | 2393 | -3280 | 2252 | -968 | -1413 | -1794 | 1632 | -939 | -2306 |
| loan_principal | other_cash_out_flows | -2555 | -4169 | -4517 | -4410 | 1893 | -2379 | 3541 | 2526 | 3118 | 509 | 251 | 157 |
| owner’s_draw | other_cash_out_flows | 1818 | 1566 | -2228 | 2396 | -3597 | -1463 | 3439 | 4057 | -1076 | -4961 | 588 | 3896 |
| other | other_cash_out_flows | -4413 | -595 | -2891 | 3857 | -4781 | 4794 | -897 | -4500 | -4005 | 4881 | 378 | -425 |
We can generate different types of visualization for the cash flow data, but on this example I will create small multiple line chart for all the items for inflow income, outflow income and other outflow expenses. All together, there are 26 categories and we want to be able to see he ebbs-and-flows of each category on its own, and plot them in a single pan so that it is easier for side by side comparison.
# Again, I have modified the data further by selecting the months _FY Apr - Sept_,
# and pivot the data to a long form to preare it for the plot
# The plot used the facet_wrap function to genrate the small multiples.
df_cashFlow_tidy %>%
mutate_each(list(as.numeric), April:Mar) %>%
pivot_longer(cols = April:Mar) %>%
ggplot(aes(x = name, y = value, group = type_2)) +
geom_line() +
facet_wrap(~ type_2, scales = "free_y") +
scale_y_continuous(labels = scales::number_si) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
theme_abiyu() +
labs(x ="", y = "")
Figure 4.2: 12 Months of Cash flow trends by category.
“The Australian Marriage Law Postal Survey was a national survey designed to gauge support for legalizing same-sex marriage in Australia. The survey was held via the postal service between 12 September and 7 November2017.” The result was released by Australia Bureau of Statistics in Excel. There are several tabs, but we are going to focus on table 1.
The Excel released is again is made easier to read, but not in tidy format needed to extend the analysis. Following the workflow established for TTC and cash flow Excel, I have identified eight items (see Figure 5.1 that will need to be addressed to convert the data into tidy format. The items include removing the header and footer, removing empty columns and raws, handling merged cells (for yes/no) and response types (response clear/eligible participants).
Figure 5.1: Australia Marriage Survey Excel annotated.
We will again import the data using the read_excel function from the readxl library. This time, we will deal with the header with slice function just to show different way of handling the data. The resulting tidy data will contain both the response clear participants and the eligible participants.
( Note: read the notes following # in the code to follow along the transformation )
# Import the Excel - notice the header is not skipped
ozmarriage_table1 <- read_excel("data/ozmarriage_table1.xlsx")
# tidy data: We use the `select` function to rename each of the variables with type of vote in the header.
# We then pivot the data to long format and removed empty rows
# Pivot to wider format again for easier display and made sure the number class is set to numeric.
oz_mrg_surv_2017 <-
ozmarriage_table1 %>%
select( states = 1,
rep_clr_yes = 2,
rep_clr_no = 4,
rep_clr_tot = 6,
elg_resp_nc = 11,
elg_non_rsp = 13,
elg_tot = 15
) %>%
slice(4:15) %>%
pivot_longer(cols = rep_clr_yes:elg_tot) %>%
drop_na() %>%
pivot_wider(names_from = "name", values_from = "value") %>%
mutate_each(list(as.numeric), rep_clr_yes:elg_tot)
write.csv(oz_mrg_surv_2017, file = "data/oz_mrg_surv_2017.csv", row.names = FALSE)
# We will now render and print out a formatted tidy data as seen in Table 5.
knitr::kable( oz_mrg_surv_2017,
caption = 'The Troronto Ridership in tidy format.',
booktabs = TRUE
) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center")
| states | rep_clr_yes | rep_clr_no | rep_clr_tot | elg_resp_nc | elg_non_rsp | elg_tot |
|---|---|---|---|---|---|---|
| New South Wales | 2374362 | 1736838 | 4111200 | 11036 | 1065445 | 5187681 |
| Victoria | 2145629 | 1161098 | 3306727 | 11028 | 743634 | 4061389 |
| Queensland | 1487060 | 961015 | 2448075 | 7088 | 695710 | 3150873 |
| South Australia | 592528 | 356247 | 948775 | 2778 | 242027 | 1193580 |
| Western Australia | 801575 | 455924 | 1257499 | 3188 | 346333 | 1607020 |
| Tasmania | 191948 | 109655 | 301603 | 805 | 77020 | 379428 |
| Northern Territory(b) | 48686 | 31690 | 80376 | 229 | 57496 | 138101 |
| Australian Capital Territory(c) | 175459 | 61520 | 236979 | 534 | 50595 | 288108 |
| Australia | 7817247 | 4873987 | 12691234 | 36686 | 3278260 | 16006180 |
Now that we have the tidy data, we can extract insight and generate all host of visualization. But to keep it light, I chose to focus on the yes or no response voters distribution, for each of the eight territories and the total survey results of Australia. Bar chart is good choice to display categorical data, and so we will generate small multiple for each territory and the whole Australia in a single pane with ggplot facet function. But before generating the small multiple, , the tidy data is transformed further to generate the desired bar plot.
( Note: read the inline comment in the code for explanation )
# transoform the data = select the variables that will be used, pivot the data to long format, count and add variable that contains the calculated percentile for each small multiple yes and no vote, and use the perctile as lable on top of the bar chart.
gg_1 <-
oz_mrg_surv_2017 %>%
select(states, rep_clr_yes, rep_clr_no) %>%
pivot_longer(cols = rep_clr_yes:rep_clr_no ) %>%
group_by(states) %>%
mutate(prcnt = scales::percent(value/sum(value))) %>%
ggplot(aes(x = name, y = value, fill = name)) +
geom_bar(stat="identity", position = "dodge",color="black") +
facet_wrap(~ states, scales = "free_y") +
scale_y_continuous(labels = scales::number_si) +
geom_text(aes(label = prcnt),
position = position_dodge(0.9),
vjust = 2,
colour = "white",
size = 4,
fontface = "bold") +
scale_fill_manual(name = "", labels = c("Respn Clear No", "Respns Clear Yes"), values=c("palevioletred2", "royalblue2")) +
theme_minimal() +
labs(x ="", y ="",
title = "Australians support Marriage law change",
subtitle = "2017 Australia Marriage Law Survey",
caption = "Data Source: Australian Buraue of Statstics") +
theme_abiyu()
gg_1
Figure 5.2: 2017 OZ Marriage Survey Result & Choropleth Map
One of the capabilities R bring in to Excel is applying data to a geospatial map, allowing the generation of thematic map (aka Choropleth map). Thematic maps allow for scaled color of a map according to a numeric value. Here I will have merged Australia’s states geocode with a tidy survey data result values. Covering how to do that is out of the scope of this blog. R has a number of libraries that will provide plotting a map. simple feature is “standardized way to encode spatial vector data”, and one that is recommended with robust features to generate thematic maps. you can read more about it here. To keep all what we need with tidyverse, I am going to use the ggplot2 data frame to generate the thematic map.
I will start with the tidy survey data (df_oz_vote) we have created from the Excel earlier. Use the dplyr verbs select, filter, pivot and mutate to transform the data and merge it with the center of each state, and use ggplot to render the plot, see on Figure 5.3. The choropleth map of Australia is generated with darker color indicating higher percentage yes or no votes, and lighter color indicating lower percentage yes or no vote. The percentage text is labeled for each state, and you can easily compare each state’s yes/no vote side by side. Most Australians voted yes! As the ol’ adage goes “A picture is worth a thousand words!
# get the tidy survey data
df_oz_vote <- read_csv("data/df_oz_vote.csv")
# The following script, selects the yes and no votes for all states - save for total Australia,
# and clean the names of the states to match with the center data
df_yes_no <-
oz_mrg_surv_2017 %>%
select(admin_1 = 1, prcnt_yes = 2, prcnt_no = 3) %>%
pivot_longer(cols = prcnt_yes:prcnt_no ) %>%
filter(admin_1 != "Australia") %>%
mutate(prcnt = scales::percent(value/sum(value))) %>%
mutate(admin_1 = tolower(admin_1)) %>%
mutate(admin_1 = str_replace_all(admin_1, "\\(b\\)", "")) %>%
mutate(admin_1 = str_replace_all(admin_1, "\\(c\\)", ""))
# A function to calculate the center of the terrotories from the goecode
mid_range <- function(x) mean(range(x, na.rm = TRUE))
# Use plyr library ddply function to generate center (close to center)
center <- plyr::ddply(df_oz_vote, .(region), colwise(mid_range, .(long, lat)))
# Merge the two data frames to generate a label data frame.
df_label <- center %>%
mutate(admin_1 = str_replace_all(region, "^state of ", "")) %>%
left_join(df_yes_no, by = "admin_1") %>%
select(-region) %>%
arrange(desc(name))
# plot the vis no and a yes
ggplot() +
geom_polygon(data = df_oz_vote, aes(x = long, y = lat, group = group, fill = value),
linetype = "solid", col = "gray56") + facet_wrap(~ name) +
geom_text(data = df_label, aes(x = long, y = lat, label = prcnt), size = 3, col = "black") +
scale_fill_distiller(name = "percent %", palette = "YlOrRd", direction = 1) +
theme(legend.position = "bottom") +
labs(title = " 2017 Australia Marriage Survey - The Yes have it",
caption = "data_source: Australian Buraue of Statstics\nMap: ChoroplethrAdmin1 | by @abiyugiday") +
theme_abiyu()
knitr::include_graphics("fig/blog3_oz_choro_srvy_vote.png")
Figure 5.3: Australia Marriage Survey thematic map
Excel remains one of the most useful and powerful application to organize and work with data. With explosion of data across industries, Excel reports are produced at a faster clip than ever. As demonstrated above, R is ideal complimentary software that will extend and enhance Excel reports to maximize value extraction. On this blog we covered three different types of Excel reports and discussed how to transform each to a tidy format with tidyverse libraries.
We used the tidy data sets to generate visual summaries that enable us to understand and share. But with further analysis, the tidy data can answer a lot of questions, which would have been difficult otherwise. The analysis scripts can be easy or complicated, depending on the Excel report type, but once created they can be used to automate any further report with similar makeup, because they are reproducible.
Hope you found this blog useful! If you have any questions or comments, feel free to drop me an e-mail at abiyu.giday@datarecode.com or DM me on twitter @abiyugiday. I am also available to discuss with you should you organization need training or have Excel report you want extended with R